EECS 800 Research Seminar Mining Biological Data

61
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

description

EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Overview. Rule-based classification method overview CBA: classification based on association Applying rules based method in Microarray analysis. Rule Based Methods Vs. SVM. - PowerPoint PPT Presentation

Transcript of EECS 800 Research Seminar Mining Biological Data

Page 1: EECS 800 Research Seminar Mining Biological Data

The UNIVERSITY of Kansas

EECS 800 Research SeminarMining Biological Data

Instructor: Luke Huan

Fall, 2006

Page 2: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide2

10/25/2006Classification III

OverviewOverview

Rule-based classification method overview

CBA: classification based on association

Applying rules based method in Microarray analysis

Page 3: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide3

10/25/2006Classification III

Rule Based Methods Vs. SVMRule Based Methods Vs. SVM

Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". SIGMOD'05, 2005.

Page 4: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide4

10/25/2006Classification III

Rule-Based ClassifierRule-Based Classifier

Classify records by using a collection of “if…then…” rules

Rule: (Condition) ywhere

Condition is a conjunctions of attributes

y is the class label

Examples of classification rules:

(Blood Type=Warm) (Lay Eggs=Yes) Birds

(Taxable Income < 50K) (Refund=Yes) Evade=No

Page 5: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide5

10/25/2006Classification III

Rule-based Classifier (Example)Rule-based Classifier (Example)

R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians

Name Blood Type Give Birth Can Fly Live in Water Classhuman warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fisheswhale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammalspigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptilespenguin warm no no sometimes birdsporcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibiansgila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birdsdolphin warm yes no yes mammalseagle warm no yes no birds

Page 6: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide6

10/25/2006Classification III

Application of Rule-Based Classifier

Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule

R1: (Give Birth = no) (Can Fly = yes) Birds

R2: (Give Birth = no) (Live in Water = yes) Fishes

R3: (Give Birth = yes) (Blood Type = warm) Mammals

R4: (Give Birth = no) (Can Fly = no) Reptiles

R5: (Live in Water = sometimes) Amphibians

The rule R1 covers a hawk => Bird

The rule R3 covers the grizzly bear => Mammal

Name Blood Type Give Birth Can Fly Live in Water Classhawk warm no yes no ?grizzly bear warm yes no no ?

Page 7: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide7

10/25/2006Classification III

Rule Coverage and AccuracyRule Coverage and Accuracy

Coverage of a rule:Fraction of records that satisfy the antecedent of a rule

Accuracy of a rule:Fraction of records that satisfy both the antecedent and consequent of a rule

(Status=Single) No

Coverage = 40%, Accuracy = 50%

Tid home MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Page 8: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide8

10/25/2006Classification III

How does Rule-based Classifier Work?

How does Rule-based Classifier Work?

R1: (Give Birth = no) (Can Fly = yes) Birds

R2: (Give Birth = no) (Live in Water = yes) Fishes

R3: (Give Birth = yes) (Blood Type = warm) Mammals

R4: (Give Birth = no) (Can Fly = no) Reptiles

R5: (Live in Water = sometimes) Amphibians

A lemur triggers rule R3, so it is classified as a mammal

A turtle triggers both R4 and R5

A dogfish shark triggers none of the rules

Name Blood Type Give Birth Can Fly Live in Water Classlemur warm yes no no ?turtle cold no no sometimes ?dogfish shark cold yes no yes ?

Page 9: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide9

10/25/2006Classification III

Characteristics of Rule-Based Classifier

Characteristics of Rule-Based Classifier

Mutually exclusive rulesEvery record is covered by at most one rule

Exhaustive rulesClassifier has exhaustive coverage if it accounts for every possible combination of attribute values

Each record is covered by at least one rule

Page 10: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide10

10/25/2006Classification III

From Decision Trees To RulesFrom Decision Trees To Rules

YESYESNONO

NONO

NONO

Yes No

{Married}{Single,

Divorced}

< 80K > 80K

Taxable Income

Marital Status

RefundClassification Rules

(Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes

(Refund=No, Marital Status={Married}) ==> No

Rules are mutually exclusive and exhaustive

Rule set contains as much information as the tree

Page 11: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide11

10/25/2006Classification III

Rules Can Be SimplifiedRules Can Be Simplified

YESYESNONO

NONO

NONO

Yes No

{Married}{Single,

Divorced}

< 80K > 80K

Taxable Income

Marital Status

Refund

Initial Rule: (Refund=No) (Status=Married) No

Simplified Rule: (Status=Married) No

Tid home MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Page 12: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide12

10/25/2006Classification III

Effect of Rule SimplificationEffect of Rule Simplification

Rules are no longer mutually exclusiveA record may trigger more than one rule

Solution?

Ordered rule set

Unordered rule set – use voting schemes

Rules are no longer exhaustiveA record may not trigger any rules

Solution?

Use a default class

Page 13: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide13

10/25/2006Classification III

Ordered Rule SetOrdered Rule Set

Rules are rank ordered according to their priorityAn ordered rule set is known as a decision list

When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has triggered

If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no) (Can Fly = yes) Birds

R2: (Give Birth = no) (Live in Water = yes) Fishes

R3: (Give Birth = yes) (Blood Type = warm) Mammals

R4: (Give Birth = no) (Can Fly = no) Reptiles

R5: (Live in Water = sometimes) Amphibians

Name Blood Type Give Birth Can Fly Live in Water Classturtle cold no no sometimes ?

Page 14: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide14

10/25/2006Classification III

Rule Ordering SchemesRule Ordering Schemes

Rule-based orderingIndividual rules are ranked based on their quality

Class-based orderingRules that belong to the same class appear together

Rule-based Ordering

(Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes

(Refund=No, Marital Status={Married}) ==> No

Class-based Ordering

(Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No

(Refund=No, Marital Status={Married}) ==> No

(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes

Page 15: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide15

10/25/2006Classification III

Building Classification RulesBuilding Classification Rules

Direct Method: Extract rules directly from data

e.g.: CBA

Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, etc).

e.g: C4.5rules

Page 16: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide16

10/25/2006Classification III

Direct Method: Sequential Covering

Direct Method: Sequential Covering

1. Start from an empty rule

2. Grow a rule using the Learn-One-Rule function

3. Remove training records covered by the rule

4. Repeat Step (2) and (3) until stopping criterion is met

Page 17: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide17

10/25/2006Classification III

Example of Sequential CoveringExample of Sequential Covering

(i) Original Data (ii) Step 1

Page 18: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide18

10/25/2006Classification III

Example of Sequential Covering…

Example of Sequential Covering…

(iii) Step 2

R1

(iv) Step 3

R1

R2

Page 19: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide19

10/25/2006Classification III

Aspects of Sequential CoveringAspects of Sequential Covering

Rule Growing

Rule Evaluation

Stopping Criterion

Rule Pruning

Page 20: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide20

10/25/2006Classification III

Rule GrowingRule Growing

Two common strategies

Status =Single

Status =Divorced

Status =Married

Income> 80K...

Yes: 3No: 4{ }

Yes: 0No: 3

Refund=No

Yes: 3No: 4

Yes: 2No: 1

Yes: 1No: 0

Yes: 3No: 1

(a) General-to-specific

Refund=No,Status=Single,Income=85K(Class=Yes)

Refund=No,Status=Single,Income=90K(Class=Yes)

Refund=No,Status = Single(Class = Yes)

(b) Specific-to-general

Page 21: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide21

10/25/2006Classification III

Rule Growing (Examples)Rule Growing (Examples)

RIPPER Algorithm:Start from an empty rule: {} => classAdd conjuncts that maximizes FOIL’s information gain measure:

R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ] where t: number of positive instances covered by both R0 and R1p0: number of positive instances covered by R0n0: number of negative instances covered by R0p1: number of positive instances covered by R1n1: number of negative instances covered by R1

Page 22: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide22

10/25/2006Classification III

Rule EvaluationRule Evaluation

Metrics:Accuracy

Laplace

M-estimate

kn

nc

1

kn

kpnc

n : Number of instances covered by rule

nc : Number of instances covered by rule

k : Number of classes

p : Prior probability

n

nc

Page 23: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide23

10/25/2006Classification III

Stopping Criterion and Rule Pruning

Stopping Criterion and Rule Pruning

Stopping criterionCompute the gainIf gain is not significant, discard the new rule

Rule PruningSimilar to post-pruning of decision treesReduced Error Pruning:

Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning If error improves, prune the conjunct

Page 24: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide24

10/25/2006Classification III

Summary of Direct MethodSummary of Direct Method

Grow a single rule

Remove Instances from rule

Prune the rule (if necessary)

Add rule to Current Rule Set

Repeat

Page 25: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide25

10/25/2006Classification III

Direct Method: RIPPERDirect Method: RIPPER

For 2-class problem, choose one of the classes as positive class, and the other as negative class

Learn rules for positive class

Negative class will be default class

For multi-class problemOrder the classes according to increasing class prevalence (fraction of instances that belong to a particular class)

Learn the rule set for smallest class first, treat the rest as negative class

Repeat with next smallest class as positive class

Page 26: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide26

10/25/2006Classification III

Indirect MethodsIndirect Methods

Rule Set

r1: (P=No,Q=No) ==> -r2: (P=No,Q=Yes) ==> +r3: (P=Yes,R=No) ==> +r4: (P=Yes,R=Yes,Q=No) ==> -r5: (P=Yes,R=Yes,Q=Yes) ==> +

P

Q R

Q- + +

- +

No No

No

Yes Yes

Yes

No Yes

Page 27: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide27

10/25/2006Classification III

Indirect Method: C4.5rulesIndirect Method: C4.5rules

Extract rules from an unpruned decision treeFor each rule, r: A y,

consider an alternative rule r’: A’ y where A’ is obtained by removing one of the conjuncts in ACompare the pessimistic error rate for r against all r’sPrune if one of the r’s has lower pessimistic error rateRepeat until we can no longer improve generalization error

Page 28: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide28

10/25/2006Classification III

Advantages of Rule-Based Classifiers

Advantages of Rule-Based Classifiers

As highly expressive as decision trees

Easy to interpret

Easy to generate

Can classify new instances rapidly

Performance comparable to decision trees

Page 29: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide29

10/25/2006Classification III

Overview of CBAOverview of CBA

Classification rule mining versus Association rule miningAim

A small set of rules as classifier

All rules according to minsup and minconf

Syntax

X y

X Y

Page 30: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide30

10/25/2006Classification III

Association Rules for Classification

Association Rules for Classification

Classification: mine a small set of rules existing in the data to form a classifier or predictor.

It has a target attribute: Class attribute

Association rules: have no fixed target, but we can fix a target.Class association rules (CAR): has a target class attribute. E.g.,

Own_house = true Class =Yes [sup=6/15, conf=6/6]CARs can obviously be used for classification.

B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. KDD, 1998http://www.comp.nus.edu.sg/~dm2/

Page 31: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide31

10/25/2006Classification III

Decision tree vs. CARsDecision tree vs. CARs

The decision tree below generates the following 3 rules.Own_house = true Class =Yes [sup=6/15, conf=6/6]Own_house = false, Has_job = true Class=Yes [sup=5/15, conf=5/5]Own_house = false, Has_job = false Class=No [sup=4/15, conf=4/4]

But there are many other rules that are not found by the decision tree

Page 32: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide32

10/25/2006Classification III

There are many more rulesThere are many more rules

CAR mining finds all of them. In many cases, rules not in the decision tree (or a rule list) may perform classification better. Such rules may also be actionable in practice

Page 33: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide33

10/25/2006Classification III

Decision tree vs. CARs (cont …)Decision tree vs. CARs (cont …)

Association mining require discrete attributes. Decision tree learning uses both discrete and continuous attributes.

CAR mining requires continuous attributes discretized. There are several such algorithms.

Decision tree is not constrained by minsup or minconf, and thus is able to find rules with very low support. Of course, such rules may be pruned due to the possible overfitting.

Page 34: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide34

10/25/2006Classification III

CBA: Three StepsCBA: Three Steps

Discretize continuous attributes, if any

Generate all class association rules (CARs)

Building a classifier based on the generated CARs.

Page 35: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide35

10/25/2006Classification III

RG: The AlgorithmRG: The Algorithm

Find the complete set of all possible rules

Usually takes long time to finish

Page 36: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide36

10/25/2006Classification III

RG: Basic ConceptsRG: Basic Concepts

Frequent ruleitemsA ruleitem is frequent if its support is above minsup

Accurate rule A rule is accurate if its confidence is above minconf

Possible ruleFor all ruleitems that have the same condset, the ruleitem with the highest confidence is the possible rule of this set of ruleitems.

The set of class association rules (CARs) consists of all the possible rules (PRs) that are both frequent and accurate.

Page 37: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide37

10/25/2006Classification III

Further Considerations in CAR mining

Further Considerations in CAR mining

Multiple minimum class supports Deal with imbalanced class distribution, e.g., some class is rare, 98% negative and 2% positive.We can set the minsup(positive) = 0.2% and minsup(negative) = 2%. If we are not interested in classification of negative class, we may not want to generate rules for negative class. We can set minsup(negative)=100% or more.

Rule pruning may be performed.

Page 38: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide38

10/25/2006Classification III

Building ClassifiersBuilding Classifiers

There are many ways to build classifiers using CARs. Several existing systems available.

Simplest: After CARs are mined, do nothing. For each test case, we simply choose the most confident rule that covers the test case to classify it. Microsoft SQL Server has a similar method.

Or, using a combination of rules.

Another method (used in the CBA system) is similar to sequential covering.

Choose a set of rules to cover the training data.

Page 39: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide39

10/25/2006Classification III

Class Builder: Three StepsClass Builder: Three Steps

The basic idea is to choose a set of high precedence rules in R to cover D.

Sort the set of generated rules RSelect rules for the classifier from R following the sorted sequence and put in C.

Each selected rule has to correctly classify at least one additional case.Also select default class and compute errors.

Discard those rules in C that don’t improve the accuracy of the classifier.

Locate the rule with the lowest error rate and discard the rest rules in the sequence.

Page 40: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide40

10/25/2006Classification III

Rules are sorted firstRules are sorted first

Definition: Given two rules, ri and rj, ri rj (also called ri precedes rj or ri has a higher precedence than rj) if

the confidence of ri is greater than that of rj, or

their confidences are the same, but the support of ri is greater than that of rj, or

both the confidences and supports of ri and rj are the same, but ri is generated earlier than rj.

A CBA classifier L is of the form:

L = <r1, r2, …, rk, default-class>

Page 41: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide41

10/25/2006Classification III

Classifier building using CARsClassifier building using CARs

Selection:Each rule does at least one correct predictionEach case is covered by the rule with highest precedence

This algorithm is correct but is not efficient

Page 42: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide42

10/25/2006Classification III

Classifier building using CARsClassifier building using CARs

For each case d in DcoverRulesd = all covering rules of d

Sort D according to the precedence of the first correctly predicting rule of each case d

RuleSet = empty

Scan D again to find optimal rule set.

Page 43: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide43

10/25/2006Classification III

Refined Classification Based on TopkRGS (RCBT)

Refined Classification Based on TopkRGS (RCBT)

General Idea:Construct RCBT classifier from top-k covering rule groups. So the number of the rule groups generated are bounded.

Efficiency and accuracy are validated by experimental results

Based on Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". SIGMOD'05

Page 44: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide44

10/25/2006Classification III

DatasetDataset

In the microarray dataset:Each row in the dataset corresponds to a sample

Each item value in the dataset corresponds to a distretized gene expression value.

Class labels correspond to category of sample, (cancer / not cancer)

Useful in diagnostic purpose

Page 45: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide45

10/25/2006Classification III

Introduction of gene expression data (Microarray)

Introduction of gene expression data (Microarray)

Format of gene expression data:Column – gene: thousands of Row — sample (class): tens of or hundreds of patients

gene1 gene2 gene3 gene4 Class

row1 10.6 1.9 33 22 C1

row2 18.6 5.6 56 10 C1

row3 5.7 4.3 133 22 C2

items Class

row1 a, b, e ,h C1

row2 c, d, e, f C1

row3 a, b, g, h C2

discretize

Page 46: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide46

10/25/2006Classification III

Rule ExampleRule Example

Rule r: {a,e,h} -> C

Support(r)=3

Condifence(r)=66%

Page 47: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide47

10/25/2006Classification III

RG: General solutionRG: General solution

Step1: Find all frequently occurred itemsets from dataset D.

Step2: Generate rule in the form of itemset -> C. Prune rules that do not have enough support an confidence.

Page 48: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide48

10/25/2006Classification III

RG: Previous Algorithms

Item enumeration: Search all the frequent itemsets by checking all possible combinations of items.

{ }

{a } {b } {c }

{ab } {ac } {bc }

{abc }

We can simulate the search process in an item enumeration tree.

Page 49: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide49

10/25/2006Classification III

Microarray dataMicroarray data

Features of Microarray dataA few rows: 100-1000

A large number of items, 10000

The space of all the combinations of items is large 210000.

Page 50: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide50

10/25/2006Classification III

MotivationsMotivations

Very slow for existing rule mining algorithmsItem search space is exponential to the number of item

use the idea of row enumeration to design new algorithm

The number of association rules are too huge even for a given consequent

mine top-k Interesting rule groups for each row

Page 51: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide51

10/25/2006Classification III

DefinitionsDefinitions

Row support set: Given a set of items I’, we denote R(I’) as the largest set of rows that contain I.

Item support set: Given a set of rows R’, we denote I(R’) as the largest set of items that are common among rows in R’.

Page 52: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide52

10/25/2006Classification III

Example

I’={a,e,h}, then R(I’)={r2,r3,r4}

R’={r2,r3}, then I(R’)={a,e,h}

Page 53: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide53

10/25/2006Classification III

Rule GroupsRule Groups

What is rule group?Given a one row dataset: {a, b, c, d, e, Cancer}, 31 rules in the form of LHS Cancer.

the same row and the same confidence (100%).

1 upper bound and 5 lower bound

Rule group: a set of association rules whose LHS itemsets occurs in a same set of rows.Rule group has a unique upper bound. abcde Cancer

Page 54: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide54

10/25/2006Classification III

Rule groups: exampleRule groups: example

Rule groups: a set of rules covered by the same rows.

upper bound rule

c-> C1 is not in the group

abc->C1(100%)

ab->C1(100%)

a->C1(100%) b->C1(100%)

ac->C1(100%) bc->C1(100%)

upper bound rule

class Items

C1 a,b,c

C1 a,b,c,d

C1 c,d,e

C2 c,d,e

Page 55: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide55

10/25/2006Classification III

Significant of RGSSignificant of RGS

Rule group r1 is more significant than r2 ifr1.conf>r2.conf or

r1.sup>r2.sup and r1.conf=r2.conf

Page 56: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide56

10/25/2006Classification III

Finding top-k rule groupsFinding top-k rule groups

Given dataset D, for each row of the dataset, find the k most significant covering rule groups (represented by the upper bounds) subject to the minimum support constraint

No min-confidence required

Page 57: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide57

10/25/2006Classification III

Top-k covering rule groupsTop-k covering rule groups

For each row, we find the most significant k rule groups:

based on confidence first then support

Given minsup=1, Top-1row 1: abcC1(sup = 2, conf= 100%)row 2: abcC1

abcdC1(sup=1,conf = 100%)row 3: cdC1(sup=2, conf = 66.7%)row 4: cdeC2 (sup=1, conf = 50%)

class Items

C1 a,b,c

C1 a,b,c,d

C1 c,d,e

C2 c,d,e

Page 58: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide58

10/25/2006Classification III

Relationship with CBARelationship with CBA

The rules selected by CBA for classification are a subset of the rules of TopkRGS with k=1

So we can use top-1 covering rule groups to build the CBA classifier

The authors also proposed a refined classification method based on TopkRGS

Reduces the chance that test data is classified by a default class;

Uses a subset of rules to make a collective decision

Page 59: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide59

10/25/2006Classification III

Main advantages of Top-k coverage rule group

Main advantages of Top-k coverage rule group

The number is bounded by the product of k and the number of samples

Treat each sample equally provide a complete description for each row (small)

The minimum confidence parameter-- instead k.

Sufficient to build classifiers while avoiding excessive computation

Page 60: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide60

10/25/2006Classification III

Naïve method of finding TopkRGSNaïve method of finding TopkRGS

Find the complete set of upper bound rules in the dataset by the row-wise algorithmPick the top-k covering rule groups for each row in the datasetInefficient -- To improve, keep track of the top-k rule groups at each enumeration node dynamically and use effective pruning strategies

Page 61: EECS 800 Research Seminar Mining Biological Data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide61

10/25/2006Classification III

ReferenceReference

W. Cohen. Fast eective rule induction. In ICML'95,

Xiaoxing Yin & Jiawei Han CPAR: Classification based on Predictive Association Rules,SDM’03

B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In KDD'98

Jiuyong Li, On Optimal Rule Discovery, IEEE Transactions on Knowledge and Data Engineering, Volume 18 ,  Issue 4, 2006

Some slides are offered by Zhang Xiang at http://www.cs.unc.edu/~xiang/