2/5/98UCLA Data Mining Short Course (3)1 Integrated Data Mining Systems Wei-Min Shen Information...

2/5/98 UCLA Data Mining Short Course (3) 1

Integrated Data Mining Systems

Wei-Min Shen

Information Sciences Institute

University of Southern California

Outline

• Objectives for Integrated System

• System Architecture

• Necessary Capabilities

• Representation Languages

• Actual System Descriptions

Objectives for Integrated KDD Systems

• Carry out the entire KDD process– Data selection

– Data preprocessing

– Data transformation

– Data mining

– Interpretation and evaluation

• Coherently integrate complementary techniques• Amplify human capabilities (e.g. see a lot)• Allow human to control the KDD process

System Architecture

• Necessary elements– Access to existing data sets or databases– Representation and storage of knowledge– Basic data mining techniques

• Deduction

• Induction

• Visualization

• Use of human guidance

Deduction

• A rigid inference procedure from the general to the specific– “All computers have CPU” “X is a computer”

“X has CPU”

• Seek evidences for a general hypothesis– “Maybe all computers have CPU”– “Check how many computers in my database

have CPU”

Induction

• A “not so rigid” inference procedure from the specific to the general– “I drove yesterday,” “you drove yesterday,” “he

drove yesterday,” …...– “every one drove yesterday”

• Seek for general patterns from data

• There are many popular induction methods– Decision trees, rules and lists, NN, ILP, ...

Visualization

• Allow humans to see very large amounts of data in one visual field

• Provide clues for abstractions by humans

The Use of Human Guidance

• The need for human guidance– Large amount of data– Large search space for possible patterns– Machines do not human’s intuition yet

• How to encode human knowledge into data mining process?

Representation

• Languages for data access and manipulation– SQL, Datalog, LDL++, Cobol, C++, …

• Languages for representing knowledge– Prolog, LDL++, Loom, …

• Prefer languages that serve multiple purpose

Examples of Integrated Systems

• IBM’s Intelligent Miner, Advanced Scout

• Recon

• DBMiner

• DataCrystal

• many more

Advanced Scout• A system that helps NBA coaches to find

and use patterns hidden in historical game data

• Example patterns– “Glenn Rice played the shooting guard position, he shot

5/6 (83%) on jump shots”

• Widely used by many NBA teams, and coaches say that “it is written with coach in mind”

• Inputs: Relational databases

• Outputs: Rule-based models

• Integrate induction, deduction, visualization

Recon Architecture

Graphical User Interface

Command Module

DeductiveDatabase

RuleInductionVisualization

Recon Server

External DB

Target DBKnowledgeRepository

Recon Visualization

• Obtain a global view of a data set– a view of tables and columns

• Noticing important phenomena hold on subsets of data– Clusters– Trends– Correlation

Recon Deductive Database

• Define concepts– high-growth:

• earnings-per-share-growth>50% and dividend-growth>50%

• Allow new concepts to be defined on the existing ones

• Effect: prepare subsets of data for further analysis

Recon Rule Induction:

• User define target concepts

• Learn a set of rules for the target concepts

• Has heuristics for modifying existing rules

• Example:– If a stock is high-growth at time t, then its

return on investment two quarters later will be greater than 20%

DBMiner Architecture

Graphical User Interface

Discovery ModuleSQL Server

DatabaseConcept

Hierarchy

DBMiner Functionalities

• Inputs: Databases and Concept Hierarchy

• Outputs: – Characteristic rules (hypothesis evidence)– Discriminate rules (evidence hypothesis)– Multi-level association rules

DBMiner Key Idea

• Attribute-Oriented Induction– Organize values of each attribute into a

hierarchy of concepts– Perform rule induction at certain “prime” level

in the hierarchies

• learn rules at a

DataCrystal (KnowledgeMiner)

• A common-representation language– “Metapatterns”

• An integrated, efficient search engine– “The Discovery Loop”

Metapatterns• Specifications for type and form of pattern• An example of metapattern

P(X,Y) & Q(Y,Z) R(X,Z)

• Examples of discovered patternscitizen(X,Y) & officialLanguage(Y,Z) speaks(X,Z) [0.98]

parent(X,Y) & ancestor(Y,Z) ancestor(X,Z) [0.99]

• Other MetapatternsIngredients(X, a, b) & Property(X,Y) Cluster(Y)

connects(C,D) & Feature(C,X) & Feature(D,Y) eql(X,Y)

The Discovery Loop

KnowledgeBase

MetapatternGenerator

Metapatterns P(X,Y) & Q(Y,Z) R(X,Z)

discovered Patterns

DataQueries

Inductive Actions

Deductive DB

computeStrengthsupervised learningclusteringcase-based reasoningregression analysisvisualization

citizen(X,Y) & officialLanguage(Y,Z) speaks(X,Z)

DataCrytal Applications• Discover common-sense regularities from a large

knowledge base (MCC)– goodStudent(X,Y), taughtBy(Y,Z) likedBy(X,Z) [0.99]

• Find circuit patterns from a telecommunication database (Bellcore)– connect(X,’cab’,Y,’ept’),endLoc(X,U),loc(Y,V) eql(U,V) [0.98]

• Build prediction models from a chemical research database (Eastman Chemical)– percentage(X,’g306’,Y),density(X,W) F35 (Y,W)

• Construct fault-detection rules from a semiconductor manufacture control database (Motorola)– receipt(W,2),p41(W,Y),time(W,179) allowedVariance(0.9,3.4)

Metapattern Generation

• Metapatterns are hard to design– A time consuming interactive process

• Challenges– No pre-labeled examples

– No pre-specified concepts

– Mostly relational concepts

– Unsupervised Learning of relational patterns

• So we need to generate metapatterns automatically

The Algorithm• Inputs: schema, value ranges, thresholds, and

domain knowledge (optional)

• Outputs: relational patterns

• Three main steps– Step 1. Find connections among tables

• relational patterns can only be found among connected tables

– Step 2. Generate transitive metapatterns

• transitive patterns constitute a very interesting subset of relational patterns (implication, inheritence, transfer through, function dependency)

– Step 3. Generate other metapatterns based on previous metapatterns

Step 1. Find connections• Identify columns that are significantly connected

– two columns are significantly connected if they have the same type and their ranges overlap significantly

– domain knowledge can be used here for • eliminating unnecessary connections (e.g., length, width)• establishing syntactically different connections (e,g., color, frequency)

• Construct the significant connection table (SCT)– a reference name is created for each connected pair– the reference names and the table names are used as

rows and columns of the SCT

An Abstract DB Example

T1: C11 char(2) C12 integer [1-9] C13 float[0.1-0.9]

T2: C21 integer[11-19] C22 float[0.1-0.9] C23 char(3)

T3: C31 integer[11-19] C32 char(2)

T4: C41 char(3) C42 float[0.0-0.1] C43 integer[1-9]

Schema and value ranges

Abstract DB Data Tables

C41 C42 C43KKK 0.1 3

SSS 0.0 7OOO 0.0 4

PPP 0.0 5OOO 0.0 5EEE 0.0 4LLL 0.0 6

MMM 0.1 4NNN 0.1 3

SSS 0.0 3QQQ 0.1 7KKK 0.1 6LLL 0.0 6

DDD 0.1 9OOO 0.1 5

C21 C22 C2317 0.6 GGG16 0.8 JJJ15 0.2 NNN16 0.7 PPP13 0.8 TTT11 0.5 KKK14 0.6 CCC13 0.4 KKK12 0.5 OOO14 0.4 OOO

C11 C12 C13MM 8 0.6

TT 4 0.5UU 5 0.7KK 2 0.4LL 9 0.5QQ 5 0.8

JJ 4 0.8MM 5 0.7

JJ 5 0.7OO 5 0.5OO 5 0.9OO 3 0.4

JJ 6 0.2NN 3 0.3

C31 C3215 KK18 LL16 OO18 JJ18 HH16 MM15 KK14 TT16 FF15 LL

T1 T2 T3 T4

DB Example Continue ...

T1 T2 T3 T4 X1 C13 C22 X2 C11 C32 X3 C12 C43X4 C21 C31 X5 C23 C41

Significant Connection Table

Step 2: Generate Metapatterns• Convert SCT to a graph G

• Find all predicate cycles in G

• Generate the complete set of transitive metapatterns

DB Example Continue ...A GrapghG constructed from SCT

T1,X1 T2,X1

T1,X2 T3,X2

T1,X3 T4,X3

T2,X4 T3,X4

T2,X5 T4,X5

DB Example Continue ...All Predicate Cycls found in G

(T2 X1 X4) (T3 X4 X2) (T1 X2 X1)

(T2 X1 X5) (T4 X5 X3) (T1 X3 X1)

(T2 X5 X1) (T1 X1 X2) (T3 X2 X4) (T2 X4 X5)

(T2 X4 X5) (T4 X5 X3) (T1 X3 X1) (T2 X1 X4)

(T1 X3 X1) (T2 X1 X4) (T3 X4 X2) (T1 X2 X3)

(T3 X2 X4) (T2 X4 X5) (T4 X5 X3) (T1 X3 X2)

(T1 X2 X3) (T4 X3 X5) (T2 X5 X1) (T1 X1 X2)

(T2 X1 X4) (T3 X4 X2) (T1 X2 X3) (T4 X3 X5) (T2 X5 X1)

(T1 X1 X2) (T3 X2 X4) (T2 X4 X5) (T4 X5 X3) (T1 X3 X1)

DB Example Continue...

• The complete set of metapatternsP1(Y1,Y2) & Q1(Y2,Y3) => R1(Y1,Y3)

P2(Y1,Y2) & Q2(Y2,Y3) & W2(Y3,Y4) => R1(Y1,Y4)

P3(Y1,Y2) & Q3(Y2,Y3) & W3(Y3,Y4) & V3(Y4,Y5) => R3(Y1,Y5)

Pattern Evaluation• Evaluate each instantiated pattern p of metapattern P by

– Computing two values:

• strength: ps = prob(R | L,U,I) = (|R|+1) / (|L| + 2)

• base: pb = sqrt( (1- ps) ps / N )

– Comparing with specified thresholds s and b:

if pb < b,

then if (ps > s) or (ps < (1-s))

then accept p

else mark p as plausible

else discard p

Examples of Evaluation

accept(T2 X4 X1) (T3 X4 X2) (T1 X2 X3) => (T1 X3 X1) [0.8, 0.15](T1 X2 X1) (T3 X4 X2) (T2 X4 X5) (T4 X5 X3) => (T1 X3 X1) [0.9, 0.11]

plausible(T1 X2 X3) (T4 X5 X3) (T2 X4 X5) => (T3 X4 X2) [0.5, 0.14]

discard(T3 X4 X2) (T2 X4 X5) (T4 X5 X3) => (T1 X2 X3) [0.4, 0.9]

when s=0.8, and b=0.5

Step 3. Propose More Metapatterns• For each metapattern P that has many plausible

patterns, do

– Select a (meta)constraint C and append it to the left hand side of P

• C must connect to at least one predicate in P

• C is a build-in predicate (e.g., =)

• C is suggested by the domain knowledge

• An ExampleP1(Y1,Y2) & Q1(Y2,Y3) & S1(Y2,O) => R1(Y1,Y3)

A Small Network Example

Network Data Tables

a1 a20 10 20 30 40 50 60 81 23 23 43 53 63 84 54 64 86 87 67 8

b1 b20 10 31 23 23 44 54 66 87 67 8

Can-reach Linked-to

Network Example Continue ...Schema and Value Ranges

CAN-REACH: A1 integer[0-8] A2 integer[0-8]LINKED-TO: B1 integer[0-8] B2 integer[0-8]

CAN-REACH LINKED-TOX1 A1 B1 X2 A2 B1 X3 A2 B2

Significant Connection Table

Network Example Continue ...The SCT Graph

CR, X1 LT, X1

CR, X2 LT, X2

CR, X3 LT, X3

Network Example Continue ...All Predicate Cycles

(LINKED-TO X1 X3) (CAN-REACH X1 X3)(LINKED-TO X3 X1) (CAN-REACH X1 X2) (LINKED-TO X2 X3)(CAN-REACH X1 X2) (LINKED-TO X2 X3) (CAN-REACH X3 X1)

Evaluate against DB (LINKED-TO X1 X3) => (CAN-REACH X1 X3) [1.0, 10]

(CAN-REACH X1 X2) (LINKED-TO X2 X3) => (CAN-REACH X1 X3) [1.0, 11]

(CAN-REACH X1 X2) (CAN-REACH X1 X3) => (LINKED-TO X2 X3) [0.1, 89](CAN-REACH X1 X3) (LINKED-TO X2 X3) => (CAN-REACH X1 X2) [0.4, 31]

(CAN-REACH X1 X3) => (LINKED-TO X1 X3) [0.5, 19]

Characteristics of Metapattern Generation

• Unsupervised learning of relational (transitivity) patterns– with no pre-specify concepts– with no pre-label examples– that have probabilistic significance – directly from databases

2/5/98UCLA Data Mining Short Course (3)1 Integrated Data Mining Systems Wei-Min Shen Information...

Documents

Transcript of 2/5/98UCLA Data Mining Short Course (3)1 Integrated Data Mining Systems Wei-Min Shen Information...

Data Mining für Business Intelligence Data Mining for ...

September 4, 20151 Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality.

1 Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Major issues in.

MINING EDUCATIONAL DATA USING DATA MINING … · Educational data mining includes machine learning and data mining techniques. Data related to ... , density based, grid based, model

Introduction to Data Mining - homepages.math.uic.eduhomepages.math.uic.edu/~jyang06/stat486/R/DataMining_JYang2014.pdfFundamentals of Data Mining Typical Data Mining Tasks Data Mining

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Secure Data Processing - uni-leipzig.de · PRIVACY-PRESERVING DATA MINING 12 Local Data Local Data Local Data Warehouse Data Mining Local Data Mining Local Data Mining Combiner Local

October 18, 2015 Data Mining: Concepts and Techniques 1 DATA MINING Motivation: Why data mining? What is data mining? Data Mining: On what kind of data?

Introduction to Introduction to Data Mining Data Mining

Web Mining – Data Mining im Internet Mining – Data Mining im Internet Vorlesung SS 2010 ... Web Mining is Data Mining for Data on the World-Wide Web Text Mining: Application of

1 Data Mining Chapter 26. 2 Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality.

20-22 POI equations.notebook - ISD 622 · 2013. 11. 19. · Solve and check. 2 4 ... 13 Shen23Shen 47 67 811 Shen Shen Shen Shen 1516 16 1719 1819 Shen Shen 2022 2122 Shen Shen Shen

Data Mining - The opportunity lies with Data!Data mining

Data Mining LECTURE # 01 Introduction to Data Mining

Access to and Add Value of Archived Data - Methodology of Data Integration and Mining for 1:1M Land Type Mapping of China Prof. Liu Chuang Prof. Shen Yuancen.

Data Mining and Applications - antoniomucherino.it · Data Mining and Applications Data Mining Why Data Mining? Introduction to Data Mining Example III - text mining Let us suppose

Statistical Data Mining€¦ · 3 Data Mining Data (re-design and maintain existing database) Mining (Analysis) -- our focus Statistical Data Mining What is Data Mining? Data mining

Visual Data Mining: An Overview What is Visual Data Mining? Survey of techniques Data Visualization Visualizing Data Mining Results Visual Data Mining.

Lecture 2: Data Mining 1. Roadmap What is data mining? Data Mining Tasks – Classification/Decision Tree – Clustering – Association Mining Data Mining.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.